library(tidyverse)
library(DPpackage)
## library(MCMCpack)
The previous example of the Dirichlet Process was based on discrete data with a discrete centering measure \(G_{0}\). So it was not a surprise that we got a posterior distribution of discrete measures \(G\). However, the Dirichlet process results in measures \(G\) that are discrete almost surely (with probability one) regardless of the centering measure \(G_{0}\). Therefore, for density estimation, adding one more layer to the model to use the DP as a mixture distribution is more appealing.
\[\begin{align*} y_{i} | \theta_{i} &\overset{\text{ind}}{\sim} f_{\theta_{i}}\\ \\ \theta_{i} | G &\overset{\text{iid}}{\sim} G\\ \\ G &\sim DP(M G_{0}) \end{align*} \]
This is interpreted as follows.
Each data point \(y_{i}\) has its own latent parameter value \(\theta_{i}\). Given this latent parameter value, \(y_{i} | \theta_{i}\) follows a parametric distribution with a density function \(f_{\theta_{i}}\).
In turn, each individual-specific latent parameter value \(\theta_{i}\) given a random probability measure \(G\) is an iid sample from the distribution defined by \(G\).
The distribution over the random measure \(G\) is governed by a Dirichlet process with a mass (concentration) parameter \(M\) and a centering measure \(G_{0}\), where \(G_{0}\) should have a support appropriate for \(\theta_{i}\).
Give the structure above and vector notations \(\boldsymbol{\theta} = (\theta_{1},\dots,\theta_{n})\) and \(\mathbf{y} = (y_{1}, \dots, y_{n})\). The posterior distribution of the random measure \(G\) given \(\boldsymbol{\theta}\) is the following.
\[\begin{align*} G | \boldsymbol{\theta} &\sim DP \left( MG_{0} + \sum^{n}_{i=1} \delta_{\theta_{i}} \right) \end{align*} \]
Consider the posterior distribution of \(\boldsymbol{\theta}\) given \(\mathbf{y}\), \(p(\boldsymbol{\theta} | \mathbf{y})\). If we integrate out \(\boldsymbol{\theta}\) in the above expression using this posterior distribution, we obtain the following.
\[\begin{align*} G | \mathbf{y} &\sim \int DP \left( MG_{0} + \sum^{n}_{i=1} \delta_{\theta_{i}} \right) \text{d} p(\boldsymbol{\theta} | \mathbf{y}) \end{align*} \]
This is an example from the BNPDA book (p13).
## https://web.ma.utexas.edu/users/pmueller/bnp/R/DP/eig121/EIG.txt
data1 <- read.table(header = TRUE, text = "
id recorded-EIG121 imputed-EIG121
1 NA 0.00941
2 0.00533 0.00533
3 0.02310 0.02310
4 NA 0.67871
5 0.00008 0.00008
6 0.27840 0.27840
7 0.06530 0.06530
8 NA 0.03108
9 NA 0.03108
10 0.00060 0.00060
11 0.02750 0.02750
12 NA 0.03108
13 0.03060 0.03060
14 NA 0.03108
15 NA 0.67871
16 0.03300 0.03300
17 NA 0.03108
18 NA 0.03108
19 NA 0.67871
20 0.09650 0.09650
21 0.02560 0.02560
22 0.01670 0.01670
23 0.06980 0.06980
24 0.00730 0.00730
25 NA 0.03108
26 NA 0.77071
27 NA 0.03108
28 0.00500 0.00500
29 NA 0.03108
31 0.00891 0.00891
32 NA 0.22861
33 NA 0.22861
34 0.02630 0.02630
35 NA 0.22861
36 NA 0.03108
37 NA 0.03108
38 NA 0.03108
39 NA 0.00941
40 0.01357 0.01357
41 0.01759 0.01759
42 0.83840 0.83840
43 1.18167 1.18167
44 0.01540 0.01540
45 0.67239 0.67239
46 0.09752 0.09752
47 0.16455 0.16455
48 0.55606 0.55606
49 0.64961 0.64961
50 0.01287 0.01287
51 0.00192 0.00192
52 0.00043 0.00043
53 0.01897 0.01897
54 0.00036 0.00036
55 0.23272 0.23272
56 0.43327 0.43327
57 1.10649 1.10649
58 0.85073 0.85073
59 0.00318 0.00318
60 0.03394 0.03394
61 0.13869 0.13869
62 0.00345 0.00345
63 0.22849 0.22849
64 0.28294 0.28294
65 0.75363 0.75363
66 0.18504 0.18504
67 0.01930 0.01930
68 1.05398 1.05398
69 0.02071 0.02071
70 0.00295 0.00295
71 0.00394 0.00394
72 0.32358 0.32358
73 0.01006 0.01006
")
As outlined in this program, we will use the following model.
\[\begin{align*} y_{i} | \theta_{i} &\overset{\text{ind}}{\sim} N(\theta_{i}, \sigma^{2})\\ \\ \theta_{i} | G &\overset{\text{iid}}{\sim} G\\ \\ G &\sim DP(M, N(0,4))\\ \\ \frac{1}{\sigma} &\sim Gamma(1, 1)\\ \end{align*} \]
The model implemented in DPdensity is the following. See the manual for detail.
\[\begin{align*} y_{i} | (\mu_{i}, \Sigma_{i}) &\overset{\text{ind}}{\sim} N(\mu_{i},\Sigma_{i})\\ \\ (\mu_{i}, \Sigma_{i}) | G &\overset{\text{iid}}{\sim} G\\ \\ G | (\alpha, G_{0}) &\sim DP(\alpha G_{0})\\ \\ G_{0} &= N \left( \mu \bigg| m_{1}, \frac{1}{k_{0}} \Sigma \right) IW \left( \Sigma | \nu_{i}, \psi_{1} \right)\\ \\ \alpha | (a_{0},b_{0}) &\sim Gamma(a_{0},b_{0})\\ m_{1} | (m_{2},s_{2}) &\sim N(m_{2},s_{2})\\ k_{0} | (\tau_{1},\tau_{2}) &\sim Gamma \left( \frac{\tau_{1}}{2}, \frac{\tau_{2}}{2} \right)\\ \psi_{1} | (\nu_{2}, \psi_{2}) &\sim IW(\nu_{2}, \psi_{2})\\ \end{align*} \]
mcmc <- list(nburn = 1000,
nsave = 10000,
nskip = 10,
ndisplay = 100)
## Example of Prior information 1
## Fixing alpha, m1, and Psi1
prior1 <- list(alpha=1,m1=rep(0,1),psiinv1=diag(0.5,1),
nu1=4, tau1=1,tau2=100)
## Example of Prior information 2
## Fixing alpha and m1
prior2 <- list(alpha=1,m1=rep(0,1),
psiinv2=solve(diag(0.5,1)),
nu1=4,nu2=4,tau1=1,tau2=100)
## Example of Prior information 3
## Fixing only alpha
prior3 <- list(alpha=1,
m2=rep(0,1),s2=diag(100000,1),
psiinv2=solve(diag(0.5,1)),
nu1=4,nu2=4,tau1=1,tau2=100)
## Example of Prior information 4
## Everything is random (alpha, m1, nd Psi1)
prior4 <- list(a0=2,b0=1,m2=rep(0,1),s2=diag(100000,1),
psiinv2=solve(diag(0.5,1)),
nu1=4,nu2=4,tau1=1,tau2=100)
## Fit the models
state <- NULL
fit1.1 <- DPdensity(y=data1$imputed.EIG121,
prior=prior1,
mcmc=mcmc,
state=state,
status=TRUE)
fit1.2 <- DPdensity(y=data1$imputed.EIG121,
prior=prior2,
mcmc=mcmc,
state=state,
status=TRUE)
fit1.3 <- DPdensity(y=data1$imputed.EIG121,
prior=prior3,
mcmc=mcmc,
state=state,
status=TRUE)
fit1.4 <- DPdensity(y=data1$imputed.EIG121,
prior=prior4,
mcmc=mcmc,
state=state,
status=TRUE)
Check posterior means.
fit1.1
##
## DPM of normals model for density estimation
##
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior1, mcmc = mcmc,
## state = state, status = TRUE)
##
## Posterior Predictive Distributions (log):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.734675 -0.007213 0.816516 0.229982 0.876233 0.914218
##
## Posterior Inference of Parameters:
## k0 ncluster
## 0.03179 2.52330
##
## Number of Observations: 72
## Number of Variables: 1
fit1.2
##
## DPM of normals model for density estimation
##
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior2, mcmc = mcmc,
## state = state, status = TRUE)
##
## Posterior Predictive Distributions (log):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.6079 -0.8741 2.1831 0.9550 2.3783 2.7068
##
## Posterior Inference of Parameters:
## k0 psi1-data1 ncluster
## 0.02394 427.65153 5.55130
##
## Number of Observations: 72
## Number of Variables: 1
fit1.3
##
## DPM of normals model for density estimation
##
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior3, mcmc = mcmc,
## state = state, status = TRUE)
##
## Posterior Predictive Distributions (log):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.8361 -0.8339 2.1825 0.9013 2.3796 2.6754
##
## Posterior Inference of Parameters:
## m1-data1 k0 psi1-data1 ncluster
## 0.18102 0.01754 474.09594 6.64690
##
## Number of Observations: 72
## Number of Variables: 1
fit1.4
##
## DPM of normals model for density estimation
##
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior4, mcmc = mcmc,
## state = state, status = TRUE)
##
## Posterior Predictive Distributions (log):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -468.4270 -0.7128 2.2117 -21.4283 2.4529 2.5927
##
## Posterior Inference of Parameters:
## m1-data1 k0 psi1-data1 ncluster alpha
## 0.312830 0.007534 3201.080112 12.233800 3.446769
##
## Number of Observations: 72
## Number of Variables: 1
Plot diagnostics.
plot(fit1.1,ask=FALSE,output="param")
plot(fit1.2,ask=FALSE,output="param")
plot(fit1.3,ask=FALSE,output="param")
plot(fit1.4,ask=FALSE,output="param")
Plot specific aspects.
## Extracting the posterior mean of the specific
## means and covariance matrices
## (only prior 2 for illustration)
DPrandom(fit1.2)
##
## Random effect information for the DP object:
##
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior2, mcmc = mcmc,
## state = state, status = TRUE)
##
##
## Posterior mean of subject-specific components:
##
## mu-data1 var-data1
## 1 0.0193149 0.0003132
## 2 0.0196523 0.0003719
## 3 0.0195060 0.0002753
## 4 0.7278682 0.0338056
## 5 0.0196232 0.0003572
## 6 0.2668811 0.0086877
## 7 0.0822295 0.0028461
## 8 0.0202380 0.0003070
## 9 0.0204050 0.0003498
## 10 0.0192207 0.0003046
## 11 0.0200893 0.0003247
## 12 0.0200863 0.0003103
## 13 0.0204879 0.0003364
## 14 0.0204447 0.0003411
## 15 0.7277419 0.0338248
## 16 0.0206074 0.0003650
## 17 0.0206461 0.0003712
## 18 0.0202675 0.0003533
## 19 0.7280343 0.0338045
## 20 0.1292736 0.0044645
## 21 0.0198303 0.0003246
## 22 0.0190999 0.0002684
## 23 0.0952444 0.0035652
## 24 0.0191947 0.0003125
## 25 0.0205857 0.0003622
## 26 0.7371182 0.0347540
## 27 0.0202673 0.0003398
## 28 0.0193754 0.0003261
## 29 0.0205024 0.0003356
## 30 0.0191687 0.0002961
## 31 0.2342404 0.0050618
## 32 0.2338385 0.0049705
## 33 0.0196127 0.0002740
## 34 0.2342252 0.0050059
## 35 0.0205750 0.0003435
## 36 0.0204956 0.0003459
## 37 0.0204189 0.0003263
## 38 0.0191022 0.0002895
## 39 0.0192178 0.0002959
## 40 0.0194586 0.0002945
## 41 0.7462193 0.0353005
## 42 0.8664941 0.0346518
## 43 0.0195951 0.0003283
## 44 0.7279628 0.0339209
## 45 0.1309314 0.0045681
## 46 0.2110084 0.0068066
## 47 0.7143412 0.0345703
## 48 0.7267252 0.0338672
## 49 0.0193588 0.0003047
## 50 0.0192636 0.0003215
## 51 0.0193367 0.0003230
## 52 0.0194252 0.0002907
## 53 0.0194776 0.0003239
## 54 0.2351395 0.0050733
## 55 0.6209941 0.0329175
## 56 0.8653310 0.0346267
## 57 0.7493435 0.0353119
## 58 0.0191919 0.0002961
## 59 0.0206568 0.0003171
## 60 0.1838068 0.0065993
## 61 0.0194561 0.0003292
## 62 0.2346873 0.0051875
## 63 0.2702528 0.0091336
## 64 0.7352495 0.0347099
## 65 0.2217447 0.0059956
## 66 0.0194304 0.0002755
## 67 0.8605347 0.0347827
## 68 0.0195886 0.0003098
## 69 0.0192587 0.0003104
## 70 0.0192932 0.0002940
## 71 0.3484460 0.0167453
## 72 0.0190817 0.0002805
## Ploting predictive information about the specific
## means and covariance matrices
## with HPD and Credibility intervals
## (only prior 2 for illustration)
## (to see the plots gradually set ask=TRUE)
plot(DPrandom(fit1.2,predictive=TRUE),ask=FALSE)
plot(DPrandom(fit1.2,predictive=TRUE),ask=FALSE,hpd=FALSE)
## Ploting information about all the specific means
## and covariance matrices
## with HPD and Credibility intervals
## (only prior 2 for illustration)
## (to see the plots gradually set ask=TRUE)
plot(DPrandom(fit1.2),ask=FALSE,hpd=FALSE)
print(sessionInfo())
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DPpackage_1.1-7.4 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.7 purrr_0.2.5 readr_1.1.1
## [7] tidyr_0.8.2 tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1 doRNG_1.7.1 rngtools_1.3.1
## [13] pkgmaker_0.27 registry_0.5 doParallel_1.0.14 iterators_1.0.10 foreach_1.4.4 knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 splines_3.5.1 haven_1.1.2 lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6
## [7] yaml_2.2.0 survival_2.43-1 rlang_0.3.0.1 pillar_1.3.0 glue_1.3.0 withr_2.1.2
## [13] readxl_1.1.0 modelr_0.1.2 bindrcpp_0.2.2 bindr_0.1.1 plyr_1.8.4 cellranger_1.1.0
## [19] munsell_0.5.0 gtable_0.2.0 rvest_0.3.2 codetools_0.2-15 evaluate_0.12 broom_0.5.0
## [25] Rcpp_0.12.19 xtable_1.8-3 backports_1.1.2 scales_1.0.0 jsonlite_1.5 hms_0.4.2
## [31] digest_0.6.18 stringi_1.2.4 rprojroot_1.3-2 grid_3.5.1 bibtex_0.4.2 cli_1.0.1
## [37] tools_3.5.1 magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2 Matrix_1.2-14
## [43] MASS_7.3-51 xml2_1.2.0 lubridate_1.7.4 rstudioapi_0.8 assertthat_0.2.0 rmarkdown_1.10
## [49] httr_1.3.1 R6_2.3.0 nlme_3.1-137 compiler_3.5.1
## Record execution time and multicore use
end_time <- Sys.time()
diff_time <- difftime(end_time, start_time, units = "auto")
cat("Started ", as.character(start_time), "\n",
"Finished ", as.character(end_time), "\n",
"Time difference of ", diff_time, " ", attr(diff_time, "units"), "\n",
"Used ", foreach::getDoParWorkers(), " cores\n",
"Used ", foreach::getDoParName(), " as backend\n",
sep = "")
## Started 2018-11-23 06:01:40
## Finished 2018-11-23 06:02:58
## Time difference of 1.290052 mins
## Used 12 cores
## Used doParallelMC as backend